Sentiment Classification is a perfect problem in Natural language Processing (NLP) for getting started in it. As the name suggests, it is classification of peoples opinion or expressions into different sentiments, such as Positive, Neutral, and Negative.
NLP is a powerful tool, but in real-world we often come across tasks which suffer from data deficit and poor model generalisation. Transfer learning solved this problem. It is the process of training a model on a large-scale dataset and then using that pretrained model to conduct learning for another downstream task (i.e., target task).
In this notebook, I am using Sentiment140. It contains two labeled data:
Data dictionary are as follows:
NOTE: The training data isn't perfectly categorised as it has been created by tagging the text according to the emoji present. So, any model built using this dataset may have lower than expected accuracy, since the dataset isn't perfectly categorised.
Let us explore the data for better understanding.
We will be requiring only target and text columns. As observed from the above report, we have only positive (4) and negative (0) sentiment. We will replace 4 as 1 for convenience.
Also, it's a perfectly balanced dataset without any skewness - equal distribution of positive and negative sentiment.
At first glance, it's evident that the data is not clean. Tweet texts often consists of other user mentions, hyperlink texts, emoticons and special characters which no value as feature to the model we are training. So we need to get rid of them. To do this, we need to perform 4 crucial process step-by-step:
Hyperlinks and Mentions: In Twitter, people can tag/mention other people's ID and share URLs/hyperlinks. We need to eliminate this as well.
Stopwords : These are commonly used words (such as “the”, “a”, “an”, “in”) which have no contextual meaning in a sentence and hence we ignore them when indexing entries for searching and when retrieving them as the result of a search query.
Spelling Correction: We can definitely expect incorrect spellings in the tweets/data, and we need to fix as many as possible, because without doing this, the following step will not work properly.
Stemming/Lemmatization: The goal of both stemming and lemmatization is to reduce inflectional and derivationally related forms of a word to a common base form. However, there is a difference which you can understand from the image below.
Lemmatization is similar to stemming with one difference - the final form is also a meaningful word. Thus, stemming operation does not need a dictionary like lemmatization. Hence, here we will be going ahead with lemmetization.
1, 2 and 4 can be done using the library NLTK, and spell-checking using pyspellchecker.
Some words like not, haven't, don't are included in stopwords and ignoring them will make sentences like this was not good and this was good or He is a nice guy... not! and He is a nice guy... ! have same predictions. So we need to eliminate the words that expresses negation, denial, refusal or prohibition.
Now, let us define the function to perform the necessary preprocessing.
Now, we will apply the function preprocess on each value of the column text where tweets are located.
Let's take a quick look at the words that are frequently used for positive and negative tweets.
We will shuffle the dataset and split it to gives train, validation and test dataset. It's important to shuffle our dataset before training. The split is in the ratio of 6:2:2 respectively.
There are two types of embedding in NLP domain:
WORD EMBEDDING
Embeddeding from Language Model)Bidirectional Encoder Representations from Transformers)Generative Pre-Training Transformer)Universal Language Model Fine-Tuning) - This is more of a process that includes word embedding along with NN architecture.SENTENCE EMBEDDING
So, the fundamental difference is that Word Embedding turns a word to N-dimensional vector, but the Sentence Embedding is much more powerful because it is able to embed not only words but phrases and sentences as well.
ULMFiT is considered to be the best choice for Transfer Learning in NLP but it is built using fast.ai library in which the code implementation is different from that of Keras or Tensorflow, hence for this notebook, we will be using Universal Sentence Encoder.
It can be used for text classification, semantic similarity, clustering and other natural language tasks. The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs.
It takes variable length English text as input and outputs a 512-dimensional vector. Handling variable length text input sounds great, but the problem is that as sentence keeps getting longer counted by words, the more diluted embedding results could be.
Hence, there are 2 Universal Sentence Encoders to choose from with different encoder architectures to achieve distinct design goals:
Both models were trained with the Stanford Natural Language Inference (SNLI) corpus. The SNLI corpus is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). Essentially, the models were trained to learn the semantic similarity between the sentence pairs.
This model is trained using DAN. DAN works in three simple steps:
The primary advantage of the DAN encoder is that compute time is linear in the length of the input sequence.
This module is about 1GB. Depending on your network speed, it might take a while to load the first time you run inference with it. After that, loading the model should be faster as modules are cached by default.
We have loaded the Universal Sentence Encoder and computing the embeddings for some text can be as easy as shown below.
We have loaded the Universal Sentence Encoder as variable embed. To have it work with Keras, it is necessary to wrap it in a Keras Lambda layer and explicitly cast its input as a string. Then we build the Keras model in its standard Functional API. We can view the model summary and realize that only the Keras layers are trainable, that is how the transfer learning task works by assuring the Universal Sentence Encoder weights untouched.
Now, the let's eliminate the confusion between the terms that is used in deep learning aspect - loss function and optimizer.
The loss function is a mathematical way of measuring how wrong the predictions are.
During the training process, we tweak and change the parameters (weights) of the model to try and minimize that loss function, and make the predictions as correct and optimized as possible. But how exactly is it done, by how much, and when?
This is where optimizers come in. They tie together the loss function and model parameters by updating the model in response to the output of the loss function.
Now, we train the model with the training dataset and validate its performance at the end of each training epoch with validation dataset.
Now that we have trained the model, we can evaluate its performance. We will some evaluation metrics and techniques to test the model.
The Learning Curve of loss and accuracy of the model on each epoch are shown as below:
Finally, lets perform some predictions to see where and why are we getting false positives.
Now, we will perform prediction on the clean test data set provided along with the train data.
We are witnessing 3 categories instead of 2, the extra sentiment that we have is neutral, but we haven't trained the model on it. Even if we try to keep it, during one-hot encoding, we will obtain 3 columns which will go against the model output architecture (which has 2). So we have to discard this.
Now, we will evaluate and predict the data with and without all the text preprocessing, and analyze the difference.
Not much of a significant difference. Let's try to see some of the outputs to understand where is the difference coming.
The objective of this notebook is to analyze and classify the sentiment of Tweets obtained from Twitter as positive or negative.
After identifying the relevant columns required, we performed an intensive text preprocessing that can be provided as an input to the model, without the need to even tokenization - a step that is required in traditional deep learning approach.
Using Universal Sentence Encoder, which is a state-of-the-art pre-trained sentence embedding module, we contextualized the tweets and created a model that holds the information as to which tweets are referring to a positive sentiment, and which ones are negative.
An interesting observation is that despite the architecture on which this Encoder works yields less accuracy, the dataset not properly tagged, and such a small number of NN layers - we are obtaining a pretty decent accuracy.